{
  "cells": [
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "ur8xi4C7S06n"
      },
      "outputs": [],
      "source": [
        "# Copyright 2025 Google LLC\n",
        "#\n",
        "# Licensed under the Apache License, Version 2.0 (the \"License\");\n",
        "# you may not use this file except in compliance with the License.\n",
        "# You may obtain a copy of the License at\n",
        "#\n",
        "#     https://www.apache.org/licenses/LICENSE-2.0\n",
        "#\n",
        "# Unless required by applicable law or agreed to in writing, software\n",
        "# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
        "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
        "# See the License for the specific language governing permissions and\n",
        "# limitations under the License."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "JAPoU8Sm5E6e"
      },
      "source": [
        "# BigQuery DataFrames ML: Prescription Drug Name Generation\n",
        "\n",
        "<table align=\"left\">\n",
        "  <td style=\"text-align: center\">\n",
        "    <a href=\"https://colab.research.google.com/github/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/applying-llms-to-data/bigquery_dataframes_ml_drug_name_generation.ipynb\">\n",
        "      <img width=\"32px\" src=\"https://www.gstatic.com/pantheon/images/bigquery/welcome_page/colab-logo.svg\" alt=\"Google Colaboratory logo\"><br> Open in Colab\n",
        "    </a>\n",
        "  </td>\n",
        "  <td style=\"text-align: center\">\n",
        "    <a href=\"https://console.cloud.google.com/vertex-ai/colab/import/https:%2F%2Fraw.githubusercontent.com%2FGoogleCloudPlatform%2Fgenerative-ai%2Fmain%2Fgemini%2Fuse-cases%2Fapplying-llms-to-data%2Fbigquery_dataframes_ml_drug_name_generation.ipynb\">\n",
        "      <img width=\"32px\" src=\"https://lh3.googleusercontent.com/JmcxdQi-qOpctIvWKgPtrzZdJJK-J3sWE1RsfjZNwshCFgE_9fULcNpuXYTilIR2hjwN\" alt=\"Google Cloud Colab Enterprise logo\"><br> Open in Colab Enterprise\n",
        "    </a>\n",
        "  </td>\n",
        "  <td style=\"text-align: center\">\n",
        "    <a href=\"https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/generative-ai/main/gemini/use-cases/applying-llms-to-data/bigquery_dataframes_ml_drug_name_generation.ipynb\">\n",
        "      <img src=\"https://www.gstatic.com/images/branding/gcpiconscolors/vertexai/v1/32px.svg\" alt=\"Vertex AI logo\"><br> Open in Vertex AI Workbench\n",
        "    </a>\n",
        "  </td>\n",
        "  <td style=\"text-align: center\">\n",
        "    <a href=\"https://console.cloud.google.com/bigquery/import?url=https://github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/applying-llms-to-data/bigquery_dataframes_ml_drug_name_generation.ipynb\">\n",
        "      <img src=\"https://www.gstatic.com/images/branding/gcpiconscolors/bigquery/v1/32px.svg\" alt=\"BigQuery Studio logo\"><br> Open in BigQuery Studio\n",
        "    </a>\n",
        "  </td>\n",
        "  <td style=\"text-align: center\">\n",
        "    <a href=\"https://github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/applying-llms-to-data/bigquery_dataframes_ml_drug_name_generation.ipynb\">\n",
        "      <img width=\"32px\" src=\"https://raw.githubusercontent.com/primer/octicons/refs/heads/main/icons/mark-github-24.svg\" alt=\"GitHub logo\"><br> View on GitHub\n",
        "    </a>\n",
        "  </td>\n",
        "</table>\n",
        "\n",
        "<div style=\"clear: both;\"></div>\n",
        "\n",
        "<b>Share to:</b>\n",
        "\n",
        "<a href=\"https://www.linkedin.com/sharing/share-offsite/?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/applying-llms-to-data/bigquery_dataframes_ml_drug_name_generation.ipynb\" target=\"_blank\">\n",
        "  <img width=\"20px\" src=\"https://upload.wikimedia.org/wikipedia/commons/8/81/LinkedIn_icon.svg\" alt=\"LinkedIn logo\">\n",
        "</a>\n",
        "\n",
        "<a href=\"https://bsky.app/intent/compose?text=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/applying-llms-to-data/bigquery_dataframes_ml_drug_name_generation.ipynb\" target=\"_blank\">\n",
        "  <img width=\"20px\" src=\"https://upload.wikimedia.org/wikipedia/commons/7/7a/Bluesky_Logo.svg\" alt=\"Bluesky logo\">\n",
        "</a>\n",
        "\n",
        "<a href=\"https://twitter.com/intent/tweet?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/applying-llms-to-data/bigquery_dataframes_ml_drug_name_generation.ipynb\" target=\"_blank\">\n",
        "  <img width=\"20px\" src=\"https://upload.wikimedia.org/wikipedia/commons/5/5a/X_icon_2.svg\" alt=\"X logo\">\n",
        "</a>\n",
        "\n",
        "<a href=\"https://reddit.com/submit?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/applying-llms-to-data/bigquery_dataframes_ml_drug_name_generation.ipynb\" target=\"_blank\">\n",
        "  <img width=\"20px\" src=\"https://redditinc.com/hubfs/Reddit%20Inc/Brand/Reddit_Logo.png\" alt=\"Reddit logo\">\n",
        "</a>\n",
        "\n",
        "<a href=\"https://www.facebook.com/sharer/sharer.php?u=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/applying-llms-to-data/bigquery_dataframes_ml_drug_name_generation.ipynb\" target=\"_blank\">\n",
        "  <img width=\"20px\" src=\"https://upload.wikimedia.org/wikipedia/commons/5/51/Facebook_f_logo_%282019%29.svg\" alt=\"Facebook logo\">\n",
        "</a>            "
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "a3a54161ab79"
      },
      "source": [
        "| | |\n",
        "|-|-|\n",
        "|Author(s) | [Ashley Xu](https://github.com/ashleyxuu) |"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "24743cf4a1e1"
      },
      "source": [
        "**_NOTE_**: This notebook has been tested in the following environment:\n",
        "\n",
        "* Python version = 3.9"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "tvgnzT1CKxrO"
      },
      "source": [
        "## Overview\n",
        "\n",
        "The goal of this notebook is to demonstrate an enterprise generative AI use case. A marketing user can provide information about a new pharmaceutical drug and its generic name, and receive ideas on marketing-oriented brand names for that drug.\n",
        "\n",
        "Learn more about [BigQuery DataFrames](https://cloud.google.com/bigquery/docs/dataframes-quickstart)."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "d975e698c9a4"
      },
      "source": [
        "### Objective\n",
        "\n",
        "In this tutorial, you learn about Generative AI concepts such as prompting and few-shot learning, as well as how to use BigFrames ML for performing these tasks simply using an intuitive dataframe API.\n",
        "\n",
        "The steps performed include:\n",
        "\n",
        "1. Ask the user for the generic name and usage for the drug.\n",
        "1. Use `bigframes` to query the FDA dataset of over 100,000 drugs, filtered on the brand name, generic name, and indications & usage columns.\n",
        "1. Filter this dataset to find prototypical brand names that can be used as examples in prompt tuning.\n",
        "1. Create a prompt with the user input, general instructions, examples and counter-examples for the desired brand name.\n",
        "1. Use [`bigframes.ml.llm.GeminiTextGenerator`](https://cloud.google.com/python/docs/reference/bigframes/latest/bigframes.ml.llm.GeminiTextGenerator) to generate choices of brand names."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "08d289fa873f"
      },
      "source": [
        "### Dataset\n",
        "\n",
        "This notebook uses the [FDA dataset](https://cloud.google.com/blog/topics/healthcare-life-sciences/fda-mystudies-comes-to-google-cloud) available at [`bigquery-public-data.fda_drug`](https://console.cloud.google.com/bigquery?ws=!1m4!1m3!3m2!1sbigquery-public-data!2sfda_drug)."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "aed92deeb4a0"
      },
      "source": [
        "### Costs\n",
        "\n",
        "This tutorial uses billable components of Google Cloud:\n",
        "\n",
        "* BigQuery (compute)\n",
        "* BigQuery ML\n",
        "\n",
        "Learn about [BigQuery compute pricing](https://cloud.google.com/bigquery/pricing#analysis_pricing_models),\n",
        "and [BigQuery ML pricing](https://cloud.google.com/bigquery/pricing#bqml),\n",
        "and use the [Pricing Calculator](https://cloud.google.com/products/calculator/)\n",
        "to generate a cost estimate based on your projected usage."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "i7EUnXsZhAGF"
      },
      "source": [
        "## Installation\n",
        "\n",
        "Install the following packages required to execute this notebook."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "2b4ef9b72d43"
      },
      "outputs": [],
      "source": [
        "%pip install --upgrade --quiet bigframes google-cloud-bigquery-connection"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "58707a750154"
      },
      "source": [
        "### Restart current runtime\n",
        "\n",
        "To use the newly installed packages in this Jupyter runtime, you must restart the runtime. You can do this by running the cell below, which will restart the current kernel."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "f200f10a1da3"
      },
      "outputs": [],
      "source": [
        "# Automatically restart kernel after installs so that your environment can access the new packages\n",
        "import IPython\n",
        "\n",
        "app = IPython.Application.instance()\n",
        "app.kernel.do_shutdown(True)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "ZleVuNh9rtqj"
      },
      "source": [
        "<div class=\"alert alert-block alert-warning\">\n",
        "<b>⚠️ The kernel is going to restart. Please wait until it is finished before continuing to the next step. ⚠️</b>\n",
        "</div>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "960505627ddf"
      },
      "source": [
        "### Import libraries"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "PyQmSRbKA8r-"
      },
      "outputs": [],
      "source": [
        "from IPython.display import Markdown\n",
        "from bigframes.ml.llm import GeminiTextGenerator\n",
        "import bigframes.pandas as bpd\n",
        "from google.cloud import bigquery_connection_v1 as bq_connection"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "sBCra4QMA2wR"
      },
      "source": [
        "### Authenticate your Google Cloud account\n",
        "\n",
        "Depending on your Jupyter environment, you may have to manually authenticate. Follow the relevant instructions below."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "74ccc9e52986"
      },
      "source": [
        "**1. Vertex AI Workbench**\n",
        "* Do nothing as you are already authenticated."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "de775a3773ba"
      },
      "source": [
        "**2. Local JupyterLab instance, run:**"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "254614fa0c46"
      },
      "outputs": [],
      "source": [
        "! gcloud auth login"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "ef21552ccea8"
      },
      "source": [
        "**3. Colab, run:**"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "603adbbf0532"
      },
      "outputs": [],
      "source": [
        "import sys\n",
        "\n",
        "if \"google.colab\" in sys.modules:\n",
        "    from google.colab import auth\n",
        "\n",
        "    auth.authenticate_user()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "BF1j6f9HApxa"
      },
      "source": [
        "## Before you begin\n",
        "\n",
        "### Set up your Google Cloud project\n",
        "\n",
        "**The following steps are required, regardless of your notebook environment.**\n",
        "\n",
        "1. [Select or create a Google Cloud project](https://console.cloud.google.com/cloud-resource-manager). When you first create an account, you get a $300 free credit towards your compute/storage costs.\n",
        "\n",
        "2. [Make sure that billing is enabled for your project](https://cloud.google.com/billing/docs/how-to/modify-project).\n",
        "\n",
        "3. [Enable the BigQuery API](https://console.cloud.google.com/flows/enableapi?apiid=bigquery.googleapis.com).\n",
        "\n",
        "4. If you are running this notebook locally, you need to install the [Cloud SDK](https://cloud.google.com/sdk)."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "WReHDGG5g0XY"
      },
      "source": [
        "#### Set your project ID\n",
        "\n",
        "**If you don't know your project ID**, try the following:\n",
        "* Run `gcloud config list`.\n",
        "* Run `gcloud projects list`.\n",
        "* See the support page: [Locate the project ID](https://support.google.com/googleapi/answer/7014113)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "oM1iC_MfAts1"
      },
      "outputs": [],
      "source": [
        "# Use the environment variable if the user doesn't provide Project ID.\n",
        "import os\n",
        "\n",
        "PROJECT_ID = \"[your-project-id]\"  # @param {type: \"string\", placeholder: \"[your-project-id]\", isTemplate: true}\n",
        "if not PROJECT_ID or PROJECT_ID == \"[your-project-id]\":\n",
        "    PROJECT_ID = str(os.environ.get(\"GOOGLE_CLOUD_PROJECT\"))\n",
        "\n",
        "# Set the project id\n",
        "! gcloud config set project {PROJECT_ID}"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "t5w0GGdYsDjS"
      },
      "source": [
        "### Load the Gemini 2.0 Flash model\n",
        "\n",
        "To learn more about all [Gemini models on Vertex AI](https://cloud.google.com/vertex-ai/generative-ai/docs/learn/models#gemini-models)."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "NrUfvucbauu0"
      },
      "outputs": [],
      "source": [
        "MODEL_ID = \"gemini-2.0-flash-001\""
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "evsJaAj5te0X"
      },
      "source": [
        "#### BigFrames configuration\n",
        "\n",
        "Next, we will specify a [BigQuery connection](https://cloud.google.com/bigquery/docs/working-with-connections). If you already have a connection, you can simplify provide the name and skip the following creation steps."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "G1vVsPiMsL2X"
      },
      "outputs": [],
      "source": [
        "# Please fill in these values.\n",
        "LOCATION = \"us\"  # @param {type:\"string\"}\n",
        "CONNECTION = \"bigframes-ml\"  # @param {type:\"string\"}\n",
        "\n",
        "connection_name = f\"{PROJECT_ID}.{LOCATION}.{CONNECTION}\""
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "WGS_TzhWlPBN"
      },
      "source": [
        "We will now try to use the provided connection, and if it doesn't exist, create a new one. We will also print the service account used."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "56Hw42m6kFrj"
      },
      "outputs": [],
      "source": [
        "# Initialize client and set request parameters\n",
        "client = bq_connection.ConnectionServiceClient()\n",
        "CONNECTION_NAME = client.connection_path(PROJECT_ID, LOCATION, CONNECTION)\n",
        "\n",
        "# Try to connect using provided connection\n",
        "try:\n",
        "    response = client.get_connection(name=CONNECTION_NAME)\n",
        "# Create a new connection on error\n",
        "except Exception:\n",
        "    connection = bq_connection.Connection(\n",
        "        friendly_name=CONNECTION, cloud_resource=bq_connection.CloudResourceProperties()\n",
        "    )\n",
        "    response = client.create_connection(\n",
        "        request=bq_connection.CreateConnectionRequest(\n",
        "            parent=client.common_location_path(PROJECT_ID, LOCATION),\n",
        "            connection_id=CONNECTION,\n",
        "            connection=connection,\n",
        "        )\n",
        "    )\n",
        "finally:\n",
        "    CONN_SERVICE_ACCOUNT = (\n",
        "        f\"serviceAccount:{response.cloud_resource.service_account_id}\"\n",
        "    )\n",
        "\n",
        "print(CONN_SERVICE_ACCOUNT)\n",
        "\n",
        "# Set service account permissions\n",
        "!gcloud projects add-iam-policy-binding {PROJECT_ID} --condition=None --no-user-output-enabled --member={CONN_SERVICE_ACCOUNT} --role='roles/bigquery.connectionUser'\n",
        "!gcloud projects add-iam-policy-binding {PROJECT_ID} --condition=None --no-user-output-enabled --member={CONN_SERVICE_ACCOUNT} --role='roles/aiplatform.user'\n",
        "!gcloud projects add-iam-policy-binding {PROJECT_ID} --condition=None --no-user-output-enabled --member={CONN_SERVICE_ACCOUNT} --role='roles/run.invoker'"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "init_aip:mbsdk,all"
      },
      "source": [
        "### Initialize BigFrames client\n",
        "\n",
        "Here, we set the project configuration based on the provided parameters."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "OCccLirpkSRz"
      },
      "outputs": [],
      "source": [
        "bpd.options.bigquery.project = PROJECT_ID\n",
        "bpd.options.bigquery.location = LOCATION"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "m8UCEtX9uLn6"
      },
      "source": [
        "## Generate a name\n",
        "\n",
        "Let's start with entering a generic name and description of the drug."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "oxphj2gnuKou"
      },
      "outputs": [],
      "source": [
        "GENERIC_NAME = \"Entropofloxacin\"  # @param {type:\"string\"}\n",
        "USAGE = \"Entropofloxacin is a fluoroquinolone antibiotic that is used to treat a variety of bacterial infections, including: pneumonia, streptococcus infections, salmonella infections, escherichia coli infections, and pseudomonas aeruginosa infections It is taken by mouth or by injection. The dosage and frequency of administration will vary depending on the type of infection being treated. It should be taken for the full course of treatment, even if symptoms improve after a few days. Stopping the medication early may increase the risk of the infection coming back.\"  # @param {type: \"string\"}\n",
        "NUM_NAMES = 10  # @param {type: \"integer\"}\n",
        "TEMPERATURE = 1.0  # @param {type: \"number\"}"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "1q-vlbalzu1Q"
      },
      "source": [
        "We can now create a prompt string, and populate it with the name and description."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "0knz5ZWMzed-"
      },
      "outputs": [],
      "source": [
        "zero_shot_prompt = f\"\"\"Provide {NUM_NAMES} unique and modern brand names in Markdown bullet point format. Do not provide any additional explanation.\n",
        "\n",
        "Be creative with the brand names. Don't use English words directly; use variants or invented words.\n",
        "\n",
        "The generic name is: {GENERIC_NAME}\n",
        "\n",
        "The indications and usage are: {USAGE}.\"\"\"\n",
        "\n",
        "print(zero_shot_prompt)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "LCRE2L720f5y"
      },
      "source": [
        "Next, let's create a helper function to predict with our model. It will take a string input, and add it to a temporary BigFrames `DataFrame`. It will also return the string extracted from the response `DataFrame`."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "LB3xgDroIxlx"
      },
      "outputs": [],
      "source": [
        "def predict(\n",
        "    model: GeminiTextGenerator, prompt: str, temperature: float = TEMPERATURE\n",
        ") -> str:\n",
        "    # Create DataFrame\n",
        "    input = bpd.DataFrame(\n",
        "        {\n",
        "            \"prompt\": [prompt],\n",
        "        }\n",
        "    )\n",
        "\n",
        "    # Return response\n",
        "    return model.predict(\n",
        "        input, temperature=temperature\n",
        "    ).ml_generate_text_llm_result.iloc[0]"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "b1ZapNZsJW2p"
      },
      "source": [
        "We can now initialize the model, and get a response to our prompt!"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "UW2fQ2k5Hsic"
      },
      "outputs": [],
      "source": [
        "# Get BigFrames session\n",
        "session = bpd.get_global_session()\n",
        "\n",
        "# Define the model\n",
        "model = GeminiTextGenerator(\n",
        "    model_name=MODEL_ID, session=session, connection_name=connection_name\n",
        ")\n",
        "\n",
        "# Invoke LLM with prompt\n",
        "response = predict(model, zero_shot_prompt)\n",
        "\n",
        "# Print results as Markdown\n",
        "Markdown(response)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "o3yIhHV2jsUT"
      },
      "source": [
        "We're off to a great start! Let's see if we can refine our response."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "mBroUzWS8xOL"
      },
      "source": [
        "## Few-shot prompting\n",
        "\n",
        "Let's try using [few-shot prompting](https://www.promptingguide.ai/techniques/fewshot). We will provide a few examples of what we're looking for along with our prompt.\n",
        "\n",
        "Our prompt will consist of 3 parts:\n",
        "* General instructions (e.g. generate $n$ brand names)\n",
        "* Multiple examples\n",
        "* Information about the drug we'd like to generate a name for\n",
        "\n",
        "Let's walk through how to construct this prompt.\n",
        "\n",
        "Our first step will be to define how many examples we want to provide in the prompt."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "MXdI78SOElyt"
      },
      "outputs": [],
      "source": [
        "# Specify number of examples to include\n",
        "\n",
        "NUM_EXAMPLES = 3  # @param {type: \"integer\"}"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "U8w4puVM_892"
      },
      "source": [
        "Next, let's define a prefix that will set the overall context."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "aQ2iscnhF2cx"
      },
      "outputs": [],
      "source": [
        "prefix_prompt = f\"\"\"Provide {NUM_NAMES} unique and modern brand names in Markdown bullet point format, related to the drug at the bottom of this prompt.\n",
        "\n",
        "Be creative with the brand names. Don't use English words directly; use variants or invented words.\n",
        "\n",
        "First, we will provide {NUM_EXAMPLES} examples to help with your thought process.\n",
        "\n",
        "Then, we will provide the generic name and usage for the drug we'd like you to generate brand names for.\n",
        "\"\"\"\n",
        "\n",
        "print(prefix_prompt)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "VI0Spv-axN7d"
      },
      "source": [
        "Our next step will be to include examples into the prompt.\n",
        "\n",
        "We will start out by retrieving the raw data for the examples, by querying the BigQuery public dataset."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "IoO_Bp8wA07N"
      },
      "outputs": [],
      "source": [
        "# Query 3 columns of interest from drug label dataset\n",
        "df = bpd.read_gbq(\n",
        "    \"bigquery-public-data.fda_drug.drug_label\",\n",
        "    col_order=[\"openfda_generic_name\", \"openfda_brand_name\", \"indications_and_usage\"],\n",
        ")\n",
        "\n",
        "# Exclude any rows with missing data\n",
        "df = df.dropna()\n",
        "\n",
        "# Drop duplicate rows\n",
        "df = df.drop_duplicates()\n",
        "\n",
        "# Print values\n",
        "df.head()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "W5kOtbNGBTI2"
      },
      "source": [
        "Let's now filter the results to remove atypical names."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "95WDe2eCCeLx"
      },
      "outputs": [],
      "source": [
        "# Remove names with spaces\n",
        "df = df[df[\"openfda_brand_name\"].str.find(\" \") == -1]\n",
        "\n",
        "# Remove names with 5 or fewer characters\n",
        "df = df[df[\"openfda_brand_name\"].str.len() > 5]\n",
        "\n",
        "# Remove names where the generic and brand name match (case-insensitive)\n",
        "df = df[df[\"openfda_generic_name\"].str.lower() != df[\"openfda_brand_name\"].str.lower()]"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "FZD89ep4EyYc"
      },
      "source": [
        "Let's take `NUM_EXAMPLES` samples to include in the prompt."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "2ohZYg7QEyJV"
      },
      "outputs": [],
      "source": [
        "# Take a sample and convert to a Pandas dataframe for local usage.\n",
        "df_examples = df.sample(NUM_EXAMPLES, random_state=5).to_pandas()\n",
        "\n",
        "df_examples"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "oU4mb1Dwgq64"
      },
      "source": [
        "Let's now load the examples data into a prompt. For consistency, we'll capitalize each example brand name.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "kzAVsF6wJ93S"
      },
      "outputs": [],
      "source": [
        "example_prompt = \"\\n\\n---\\n\\n\".join(\n",
        "    f\"Generic name: {generic_name}\\nUsage: {usage}\\nBrand name: {brand_name.capitalize()}\"\n",
        "    for brand_name, generic_name, usage in zip(\n",
        "        df_examples[\"openfda_brand_name\"],\n",
        "        df_examples[\"openfda_generic_name\"],\n",
        "        df_examples[\"indications_and_usage\"],\n",
        "    )\n",
        ")\n",
        "example_prompt"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "kbV2X1CXAyLV"
      },
      "source": [
        "Finally, we can create a suffix to our prompt. This will contain the generic name of the drug, its usage, ending with a request for brand names."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "OYp6W_XfHTlo"
      },
      "outputs": [],
      "source": [
        "suffix_prompt = f\"\"\"\\n\n",
        "---\n",
        "\n",
        "Generic name: {GENERIC_NAME}\n",
        "Usage: {USAGE}\n",
        "Brand names:\"\"\"\n",
        "\n",
        "print(suffix_prompt)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "RiaisW1nihJP"
      },
      "source": [
        "Let's pull it altogether into a few-shot prompt."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "99xdU7l8C1h8"
      },
      "outputs": [],
      "source": [
        "# Define the prompt\n",
        "few_shot_prompt = prefix_prompt + example_prompt + suffix_prompt\n",
        "\n",
        "# Print the prompt\n",
        "print(few_shot_prompt)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "nbUWdHtfitWn"
      },
      "source": [
        "Now, let's pass our prompt to the LLM, and get a response!"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "d4ODRJdvLhlQ"
      },
      "outputs": [],
      "source": [
        "response = predict(few_shot_prompt)\n",
        "\n",
        "Markdown(response)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "pFakjrTElOBs"
      },
      "source": [
        "# Batch generation\n",
        "\n",
        "Let's take these experiments to the next level by generating many names in a batch. We'll see how to leverage BigFrames at scale!\n",
        "\n",
        "We can start by finding drugs that are missing brand names. There are approximately 4,000 drugs that meet this criteria. We'll put a limit of 100 in this notebook."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "8eAutS41mx6U"
      },
      "outputs": [],
      "source": [
        "# Query 3 columns of interest from drug label dataset\n",
        "df_missing = bpd.read_gbq(\n",
        "    \"bigquery-public-data.fda_drug.drug_label\",\n",
        "    col_order=[\"openfda_generic_name\", \"openfda_brand_name\", \"indications_and_usage\"],\n",
        ")\n",
        "\n",
        "# Exclude any rows with missing data\n",
        "df_missing = df_missing.dropna()\n",
        "\n",
        "# Include rows in which openfda_brand_name equals openfda_generic_name\n",
        "df_missing = df_missing[\n",
        "    df_missing[\"openfda_generic_name\"].str.lower()\n",
        "    == df_missing[\"openfda_brand_name\"].str.lower()\n",
        "]\n",
        "\n",
        "# Limit the number of rows for demonstration purposes\n",
        "LIMIT = 10\n",
        "df_missing = df_missing.head(LIMIT)\n",
        "\n",
        "# Print values\n",
        "df_missing.head()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Fm6L8S7eVnCI"
      },
      "source": [
        "We will create a column `prompt` with a customized prompt for each row."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "19TvGN1PVmVX"
      },
      "outputs": [],
      "source": [
        "df_missing[\"prompt\"] = (\n",
        "    \"Provide a unique and modern brand name related to this pharmaceutical drug.\"\n",
        "    + \"Don't use English words directly; use variants or invented words. The generic name is: \"\n",
        "    + df_missing[\"openfda_generic_name\"]\n",
        "    + \". The indications and usage are: \"\n",
        "    + df_missing[\"indications_and_usage\"]\n",
        "    + \".\"\n",
        ")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "njxwBvCKgMPE"
      },
      "source": [
        "We'll create a new helper method, `batch_predict()` and query the LLM. The job may take a couple minutes to execute."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "tiSHa5B4aFhw"
      },
      "outputs": [],
      "source": [
        "def batch_predict(\n",
        "    model: GeminiTextGenerator, input: bpd.DataFrame, temperature: float = TEMPERATURE\n",
        ") -> bpd.DataFrame:\n",
        "    return model.predict(input, temperature=temperature).ml_generate_text_llm_result\n",
        "\n",
        "\n",
        "response = batch_predict(model, df_missing[\"prompt\"])"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "K5a2nHdLgZEj"
      },
      "source": [
        "Let's check the results for one of our responses!"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "TnizdeqBdbZj"
      },
      "outputs": [],
      "source": [
        "# Pick a sample\n",
        "k = 0\n",
        "\n",
        "# Gather the prompt and response details\n",
        "df_missing = df_missing.head(LIMIT).reset_index(drop=True)\n",
        "response = response.head(LIMIT).reset_index(drop=True)\n",
        "\n",
        "prompt_generic = df_missing[\"openfda_generic_name\"][k]\n",
        "prompt_usage = df_missing[\"indications_and_usage\"][k]\n",
        "response_str = response[k]\n",
        "\n",
        "# Print details\n",
        "print(f\"Generic name: {prompt_generic}\")\n",
        "print(f\"Brand name: {prompt_usage}\")\n",
        "print(f\"Response: {response_str}\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "W4MviwyMI-Qh"
      },
      "source": [
        "Congratulations! You have learned how to use generative AI to jump-start the creative process.\n",
        "\n",
        "You've also seen how BigFrames can manage each step of the process, including gathering data, data manipulation, and querying the LLM."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Bys6--dVmq7R"
      },
      "source": [
        "## Cleaning up\n",
        "\n",
        "To clean up all Google Cloud resources used in this project, you can [delete the Google Cloud\n",
        "project](https://cloud.google.com/resource-manager/docs/creating-managing-projects#shutting_down_projects) you used for the tutorial.\n",
        "\n",
        "Otherwise, you can uncomment the remaining cells and run them to delete the individual resources you created in this tutorial:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "cIODjOLump_-"
      },
      "outputs": [],
      "source": [
        "# Delete the BigQuery Connection\n",
        "from google.cloud import bigquery_connection_v1 as bq_connection\n",
        "\n",
        "client = bq_connection.ConnectionServiceClient()\n",
        "client.delete_connection(name=CONNECTION_NAME)\n",
        "print(f\"Deleted connection {CONNECTION_NAME}.\")"
      ]
    }
  ],
  "metadata": {
    "colab": {
      "name": "bigquery_dataframes_ml_drug_name_generation.ipynb",
      "toc_visible": true
    },
    "kernelspec": {
      "display_name": "Python 3",
      "name": "python3"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}
